Basic Statistics in R

RAdelaide 2025

Author
Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

Published

July 9, 2025

Statistics in R

Introduction

  • R has it’s origins as a statistical analysis language (i.e. S)
  • Purpose of this session is NOT to teach statistical theory
    • I am a bioinformatician NOT a statistician
  • Perform simple analyses in R
  • Up to you to know what you’re doing
    • Or talk to your usual statisticians & collaborators

Distributions

  • R comes with nearly every distribution
  • Standard syntax for accessing each

Distributions

Distribution Density Area Under Curve Quantile Random
Normal dnorm() pnorm() qnorm() rnorm()
T dt() pt() qt() rt()
Uniform dunif() punif() qunif() runif()
Exponential dexp() pexp() qexp() rexp()
\(\chi^2\) dchisq() pchisq() qchisq() rchisq()
Binomial dbinom() pbinom() qbinom() rbinom()
Poisson dpois() ppois() qpois() rpois()

Distributions

  • Also Beta, \(\Gamma\), Log-Normal, F, Geometric, Cauchy, Hypergeometric etc…
?Distributions

Distributions

## dnorm gives the classic bell-curve
tibble(
  x = seq(-4, 4, length.out = 1e3)
) %>% 
  ggplot(aes(x, y = dnorm(x))) + 
  geom_line(colour = "red")

## pnorm gives the area under the 
## bell-curve (which sums to 1)
tibble(
  x = seq(-4, 4, length.out = 1e3)
) %>% 
  ggplot(aes(x, y = pnorm(x))) + 
  geom_line()

The T Distribution

  • A T distribution looks very much like a Standard normal N(0, 1) but has heavier tails
  • This allows for greater uncertainty in the tails

Tests For Continuous Data

Data For This Session

We’ll use the pigs dataset from earlier

library(tidyverse)
library(scales)
library(car)
library(palmerpenguins)
theme_set(
  theme_bw() + theme(plot.title = element_text(hjust = 0.5))
)
pigs <- file.path("data", "pigs.csv") %>%
    read_csv %>%
    mutate(
      dose = fct(dose, levels = c("Low", "Med", "High")),
      supp = fct(supp)
    )

Data For This Session

pigs %>% 
  ggplot(
    aes(x = dose, y = len, fill = supp)
  ) +
    geom_boxplot()

Pop Quiz

  • A p-value is the probability of observing a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.
  • In plain English, assuming there’s nothing interesting to see here, how likely are we to observe our result, or one even more extreme
  • A p-value of 0.05 \(\implies\) about 1 in 20 times we’ll see something like this in a random sample

t-tests

  • Assumes normally distributed data
  • \(t\)-tests always test \(H_0\) Vs \(H_A\)
    • For data with exactly two groups

. . .

  • The simplest test is on a simple vector
    • Not particularly meaningful for our data
?t.test
t.test(pigs$len)

The true mean of the underlying distribution from which the vector is sampled, is zero: i.e. \(\mu = 0\)

t-tests

When comparing the means of two vectors

\[ H_0: \mu_{1} = \mu_{2} \\ H_A: \mu_{1} \neq \mu_{2} \]

We could use two vectors (i.e. x & y)

vc <- dplyr::filter(pigs, supp == "VC")$len
oj <- dplyr::filter(pigs, supp == "OJ")$len
t.test(x = vc, y = oj)

. . .

No

t-tests

  • An alternative is the R formula method: len~supp
    • Length is a response variable
    • Supplement is the predictor
  • Can only use one predictor for a T-test
    • Otherwise it’s linear regression
t.test(len~supp, data = pigs)

Did this give the same results?

t-tests

  • Do we think the variance is equal between the two groups?
pigs |> summarise(sd = sd(len), .by = supp)
# A tibble: 2 × 2
  supp     sd
  <fct> <dbl>
1 VC     8.27
2 OJ     6.61

. . .

  • We can use Levene’s Test to formalise this
    • From the package car
    • Bartlett’s test is very similar (bartlett.test())
leveneTest(len~supp, data = pigs)

t-tests

  • Now we can assume equal variances
    • By default, variances are assumed to be unequal
t.test(len~supp, data = pigs, var.equal = TRUE)

. . .

  • If relevant, the confidence interval can also be adjusted

Wilcoxon Tests

  • We assumed the above dataset was normally distributed:
    What if it’s not?

. . .

  • Non-parametric equivalent is the Wilcoxon Rank-Sum Test (aka Mann-Whitney)

. . .

  • This assigns ranks to each value based on their value
    • The test is then performed on ranks NOT the values
    • Tied values can be problematic
  • Test that the centre of each underlying distribution is the same
wilcox.test(len~supp, data = pigs)

A Brief Comment

  • Both of these are suitable for comparing two groups
  • T-tests assume Normally Distributed Data underlies the random sample
    • Are robust to some deviation from normality
    • Data can sometimes be transformed (e.g. sqrt(), log() etc)

. . .

  • The Wilcoxon Rank Sum Test assumes nothing about the underlying distribution
    • Much less powerful will small sample sizes
    • Highly comparable at n \(\geq\) 30
  • The package coin implements a range on non-parametric tests

Tests For Categorical Data

\(\chi^2\) Test

  • Here we need counts and categories
  • Commonly used in Observed Vs Expected

\[ H_0: \text{No association between groups and outcome}\\ H_A: \text{Association between groups and outcome} \]

When expected cell values are > 5 (Cochran 1954)

\(\chi^2\) Test

pass <- matrix(
  c(25, 8, 6, 15), nrow = 2, 
  dimnames = list(
    c("Attended", "Skipped"), 
    c("Pass", "Fail"))
)
pass
         Pass Fail
Attended   25    6
Skipped     8   15

. . .


pass_chisq <- chisq.test(pass)
pass_chisq

    Pearson's Chi-squared test with Yates' continuity correction

data:  pass
X-squared = 9.8359, df = 1, p-value = 0.001711

Fisher’s Exact Test

  • \(\chi^2\) tests became popular in the days of the printed tables
    • We now have computers
  • Fisher’s Exact Test is preferable in the cases of low cell counts
    • (Or any other time you feel like it…)
  • Same \(H_0\) as the \(\chi^2\) test
  • Uses the hypergeometric distribution
fisher.test(pass)

Summary of Tests

  • t.test(), wilcox.test()
  • chisq.test(), fisher.test()

. . .

  • shapiro.test(), bartlett.test()
  • car::leveneTest()
    • Tests for normality or homogeneity of variance

. . .

  • binomial.test(), poisson.test()
  • kruskal.test(), ks.test()

htest Objects

  • All produce objects of class htest
  • Is really a list
    • Use names() to see what other values are returned
names(pass_chisq)

. . .

  • Will vary slightly between tests
  • Can usually extract p-values using test$p.value
pass_chisq$p.value

htest Objects

## Have a look at the list elements produced by fisher.test
fisher.test(pass) |> names()
[1] "p.value"     "conf.int"    "estimate"    "null.value"  "alternative"
[6] "method"      "data.name"  

. . .


## Are these similar to those produced by t.test?
t.test(len~supp, data = pigs) |> names()
 [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
 [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  

. . .

  • There is a function print.htest() which organises the printout for us
  • We’ll come back to methods later, but this is a common way for output to be produced
  • The will be a print function for objects of each class, i.e. print.class_of_object

Linear Regression

Linear Regression

We are trying to estimate a line with slope & intercept

\[ y = ax + b \]

. . .

Or

\[ y = \beta_0 + \beta_1 x \]

  • \(y\) is the response variable
  • \(x\) is the predictor variable
  • Makes the most intuitive sense when both \(x\) & \(y\) are continuous

Linear Regression

Linear Regression always uses the R formula syntax

  • y ~ x: y depends on x
  • We use the function lm()
  • Once we have our model, we can predict \(y\) based on \(x\) values

. . .

  • We’ll use the penguins dataset for some exploration

Linear Regression

## Does bill length depend on body mass?
## Plot the predictor (body mass) on `x`
## The response (bill_length) goes on `y`
penguins |>
  ggplot(
    aes(body_mass_g, bill_length_mm)
  ) +
  geom_point() +
  geom_smooth(method = "lm")

  • There looks like a fairly clear linear relationship

Linear Regression

  • Bill length is the response variable (\(y\))
  • Body Mass is the predictor variable (\(x\))
## Fit a linear regression model and save the output as an R object
bill_length_lm <- lm(bill_length_mm ~ body_mass_g, data = penguins)
bill_length_lm

Call:
lm(formula = bill_length_mm ~ body_mass_g, data = penguins)

Coefficients:
(Intercept)  body_mass_g  
  26.898872     0.004051  

Linear Regression

  • The basic output is only minimally informative
  • Often pass to the function summary() to present and interpret
## Use `summary` to view and interpret the fitted model
summary(bill_length_lm)

Call:
lm(formula = bill_length_mm ~ body_mass_g, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.1251  -3.0434  -0.8089   2.0711  16.1109 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.690e+01  1.269e+00   21.19   <2e-16 ***
body_mass_g 4.051e-03  2.967e-04   13.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.394 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.3542,    Adjusted R-squared:  0.3523 
F-statistic: 186.4 on 1 and 340 DF,  p-value: < 2.2e-16

Linear Regression

  1. The code used is returned as Call:
  2. A brief summary of the residuals are returned
  3. The fitted values are returned in the Coefficients element
    • We have the estimate of the intercept & slope
    • The standard error of the estimates
    • A t-test testing \(H_0: \beta_i = 0\)
    • The p-value for each t-test
  4. Additional model summary information
    • \(R^2\) is the proportion of variance explained by the model
  • Adjusted R-squared is the R-squared adjusted for the number of predictors
  • Mostly relevant when comparing models where predictors vary

Coefficients

How do we interpret the Intercept?

This is what the bill length would be if a penguin weighed exactly 0

  • We’d probably expect this to be zero but they almost never are
  • The Intercept is almost always significant
  • What does this really tell us about the relationship between bill length and weight?
  • Generally focussed on the relationship within the range of observed predictors
    • No guaranteed linear relationship outside of this range

Coefficients

How do we interpret the body_mass_g term?

This is how much we would expect the bill length to change for every one unit increase in the predictor
i.e. for every 1\(g\) increase in weight, the bill length would be expected to increase by about 0.0041mm

  • The \(t\)-test here is highly relevant
    • \(H_0\colon \beta_1 = 0~\) Vs \(~H_A\colon \beta_1 \neq 0\)
  • Reject \(H_0 \implies\) there is an association between predictor & response

Linear Regression

  • Points never lie exactly on the regression line \(\implies\) Why?

. . .

  • We’re actually fitting the model

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

  • \(\beta_0 + \beta_1 x_i\) is the exact line co-ordinate (Intercept + slope*predictor)
  • \(\epsilon_i\) is the the vertical difference between the observed value and the fitted value
    • Known as a residual
    • Defined as \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)
  • Residual means the bit left over when we subtract the predicted value from the observed

Linear Regression

Linear Regression formally has 4 key assumptions

  1. Linear relationship between predictor and response
    • Mean of residuals is zero across the entire range
  2. Constant variance across the range of data (homoscedasticity)
  3. Residuals are normally distributed
  4. Independence of errors
  • Three of these are represented in the definition \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)

Linear Regression

  • To check our fitted model, we should check the residuals to ensure \(\epsilon_i \sim \mathcal{N}(0, \sigma)\)
## Check the residuals. It will ask you the following
## Hit <Return> to see next plot: 
## So follow the instructions and check each plot before proceeding
## There are 4 plots by default
plot(bill_length_lm)

Linear Regression


  • Check the zero mean of \(\mathcal{N}(0, \sigma)\)
  • Is this assumption satisfied across the range of the data?

Linear Regression


  • Check the normality of \(\mathcal{N}(0, \sigma)\)
  • The dashed line is the expected line from a normal distribution
  • Is this assumption satisfied across the range of the data?

Linear Regression


  • Check the constant variance of \(\mathcal{N}(0, \sigma)\)
  • Is this assumption satisfied across the range of the data?

Linear Regression


  • Checks if any points are exerting excessive ‘leverage’ on the model
  • Beyond scope of today

Regression Diagnostic Plots

  • All of these figures used base plotting functions
  • Stepping through can be frustrating
    • Especially when running an automated script

. . .

  • Can plot all 4 at once using a simple trick
## Define a 2x2 grid, plotting figures in a row-wise manner
par(mfrow = c(2, 2))
## Plot the four regression diagnostic plots
plot(bill_length_lm)
## Reset the grid layout to be the default 1x1
par(mfrow = c(1, 1,))
  • par() is a function to set graphical parameters
  • mfrow means ‘Multi-Figures By Row’
  • c(2, 2,) means plot in a 2x2 grid
  • I rarely show these so no need for ggplot versions

Objects Of Class lm

  • The linear model we fitted produced an object of class lm
class(bill_length_lm)

. . .

  • This is also a list
## Inspect the actual object
glimpse(bill_length_lm)

Objects Of Class lm

  • We can use the list structure to inspect the residuals manually
## Plot a histogram of the residuals
hist(bill_length_lm$residuals)

Objects Of Class lm

  • We could even use the Shapiro-Wilk test for normality
shapiro.test(bill_length_lm$residuals)

    Shapiro-Wilk normality test

data:  bill_length_lm$residuals
W = 0.95439, p-value = 8.217e-09

. . .

  • How can we interpret all of this?
  • Maybe there’s a better model

Adding Terms

## Does bill length depend on body mass?
## Plot the predictor (body mass) on `x`
## The response (bill_length) goes on `y`
## Does each species need it's own model
penguins |>
  ggplot(
    aes(
      body_mass_g, bill_length_mm, 
      colour = species
    )
  ) +
  geom_point() +
  geom_smooth(method = "lm")

  • Do we think these slopes are the same?
  • Do we think the intercepts are the same?

Adding Terms

## Include the species in the model
## NB: This will fit a separate intercept for each species
bill_length_sp_lm <- lm(bill_length_mm ~ species + body_mass_g, data = penguins) 
summary(bill_length_sp_lm)

Call:
lm(formula = bill_length_mm ~ species + body_mass_g, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8129 -1.6718  0.1336  1.4720  9.2902 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      2.492e+01  1.063e+00  23.443  < 2e-16 ***
speciesChinstrap 9.921e+00  3.511e-01  28.258  < 2e-16 ***
speciesGentoo    3.558e+00  4.858e-01   7.324 1.78e-12 ***
body_mass_g      3.749e-03  2.823e-04  13.276  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.403 on 338 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.808, Adjusted R-squared:  0.8063 
F-statistic:   474 on 3 and 338 DF,  p-value: < 2.2e-16
  • There is already a large increase in R-squared so this model may fit better

Adding Terms

## Check the residuals after 
## including a separate intercept
## for species
par(mfrow = c(2, 2))
plot(bill_length_sp_lm)
par(mfrow = c(1, 1))

Model Diagnostics

## Check the histogram of residuals
hist(bill_length_sp_lm$residuals, breaks = 20)
## Perform the Shapiro-Wilk test for Normality
shapiro.test(bill_length_sp_lm$residuals)

    Shapiro-Wilk normality test

data:  bill_length_sp_lm$residuals
W = 0.99317, p-value = 0.123

Interpreting the Coefficients

  • Now we’re happier with the model \(\rightarrow\) what do the coefficients mean?
                     Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)      24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap  9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo     3.557977539 0.4857896978  7.324111 1.776921e-12
body_mass_g       0.003748497 0.0002823439 13.276352 1.158990e-32

. . .

  • By default, the primary Intercept is the first factor level in species
levels(penguins$species)
[1] "Adelie"    "Chinstrap" "Gentoo"   

Interpreting the Coefficients

  • Now we’re happier with the model \(\rightarrow\) what do the coefficients mean?
                     Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)      24.919470977 1.0630034684 23.442511 7.632983e-73
speciesChinstrap  9.920884113 0.3510790185 28.258265 5.093822e-91
speciesGentoo     3.557977539 0.4857896978  7.324111 1.776921e-12
body_mass_g       0.003748497 0.0002823439 13.276352 1.158990e-32
  • The baseline intercept is for Adelie penguins
  • Additional intercept terms are the differences between the baseline and each species
    • Does this check-out in our initial plot of the data?
  • They all appear significant when checking each \(H_0\)

Adding Terms

  • Do we think each species may have a different relationship between mass and bill length?
    • Do we need to fit a separate slope for each species?
  • This is done in R using an “interaction term” separating terms by :
    • species:body_mass_g
bill_length_int_lm <- lm(
  bill_length_mm ~ species + body_mass_g + species:body_mass_g, 
  data = penguins
) 
summary(bill_length_int_lm)
  • Now what do we think?

Interpreting the Coefficients

  • The baseline terms for the Intercept and body_mass_g are now both for Adelie
  • Differences in slope for each species are provided as speciesChinstrap:body_mass_g and speciesGentoo:body_mass_g

. . .

  • The regression line for Adelie is y = 26.99 + 0.0032*body_mass_g

. . .

  • For Chinstrap: y = (26.99+5.18) + (0.0032+0.0013)*body_mass_g`

Interpreting the Coefficients

  • The model matrix \(X\) is a key part of understanding statistics
  • The model coefficients (\(\hat{\beta}\)) are actually estimated by performing linear algebra on this
  • The least squares solution is \(\hat{\beta} = (X^TX)^{-1}X^Ty\)

. . .

y <- penguins$bill_length_mm[!is.na(penguins$body_mass_g)]
X <- model.matrix(~ species + body_mass_g + species:body_mass_g, data = penguins)

Interpreting the Coefficients

head(X)
  (Intercept) speciesChinstrap speciesGentoo body_mass_g
1           1                0             0        3750
2           1                0             0        3800
3           1                0             0        3250
5           1                0             0        3450
6           1                0             0        3650
7           1                0             0        3625
  speciesChinstrap:body_mass_g speciesGentoo:body_mass_g
1                            0                         0
2                            0                         0
3                            0                         0
5                            0                         0
6                            0                         0
7                            0                         0

Interpreting the Coefficients

## No need to save this, I'm just demonstrating a point
## The inverse of a matrix is found using `solve()`
## Matrix multiplication is performed using %*%
## The transpose of a matrix is found using `t()`
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% y
cbind(beta_hat, coef(bill_length_int_lm))
                                      [,1]          [,2]
(Intercept)                  26.9941391366 26.9941391367
speciesChinstrap              5.1800537287  5.1800537287
speciesGentoo                -0.2545906615 -0.2545906615
body_mass_g                   0.0031878758  0.0031878758
speciesChinstrap:body_mass_g  0.0012748183  0.0012748183
speciesGentoo:body_mass_g     0.0009029956  0.0009029956

Interpreting the Coefficients

  • An excellent primer on model matrices is here1
  • Conventional statistics always sets a baseline level for the intercept
    • Performs a \(t\)-test automatically for differences
    • Really quite convenient if less obvious
  • We could remove the common intercept using ~ 0 + species + body_mass_g.
    • Would return a separate intercept by removing the common (baseline) level
    • No real advantage except non-statisticians like it
  • No statistical evidence provided for differences in intercepts
    \(\implies\) have to test ourselves 😢

Model Diagnostics

## Check the residuals after 
## including a separate intercept
## and slope for species
par(mfrow = c(2, 2))
plot(bill_length_int_lm)
par(mfrow = c(1, 1))

Model Selection

  • How do we decide on the best model?
  • A common technique is Analysis of Variance (ANOVA)
  • Classic ANOVA checks importance of each term within a model
## Run a separate ANOVA on each model
anova(bill_length_lm)
anova(bill_length_sp_lm)
anova(bill_length_int_lm)

. . .

  • Does this give any clue as to the best model?

Model Selection

  • We have progressively added terms to each model
  • Can use ANOVA to compare suitability of each model
## Compare the models using ANOVA
## If a model is a significant improvement, the p-value from the 
## F test will be clearly significant
anova(
  bill_length_lm,
  bill_length_sp_lm,
  bill_length_int_lm
)

. . .

  • It looks like the separate slopes are not an improvement
  • The separate intercepts are an improvement

Speeding The Process Up

  • This was a careful breakdown of finding the best model
  • We can partially automate this and use some shortcuts

. . .

  • The shorthand for an interaction term with separate intercepts is *
## Refit the model with an interaction term using the shorthand code
bill_length_int_lm <- lm(
  bill_length_mm ~ species * body_mass_g, data = penguins
)
summary(bill_length_int_lm)

. . .

  • Alternatively, all terms can be placed inside brackets & raised to a power
    • (species + body_mass_g)^2 would give two-way interactions

Speeding The Process Up

  • The AIC is a function of the number of model terms and the log likelihood function
  • Beyond the scope of today
  • Doesn’t perform well with highly correlated predictors
  • After specifying a ‘full’ model \(\implies\) use step() to remove redundant terms
    • Removes terms in a step-wise manner
    • Uses Akaike’s Information Criterion (AIC) to determine optimal model
    • Finds model with lowest AIC
step(bill_length_int_lm)

Objects of Class summary.lm

  • The coefficients element of the basic lm object only had the fitted values
    • Not the std errors, t-tests or p-values
## Extract the coefficients directly from the linear model
bill_length_sp_lm$coefficients

. . .

  • These values are produced by the function summary()
## Extract the coefficients from the summary object
summary(bill_length_sp_lm)$coefficients

Objects of Class summary.lm

## Check the class of the output from summary
summary(bill_length_sp_lm) |> class()
## Prove conclusively that it is really a list
## with a class attribute stuck on it
summary(bill_length_sp_lm) |> is.list()
## See what attributes it has
summary(bill_length_sp_lm) |> attributes()
## Check the full details
summary(bill_length_sp_lm) |> glimpse()

. . .

  • The full complexity of the object is mostly irrelevant

Using A More Tidyverse Friendly Approach

  • The function tidy() from the package broom is a catch-all function
    • Will return a tibble
    • Returns the same from lm and summary.lm objects
bill_length_sp_lm |> broom::tidy()

Adding Significance Stars

  • The easiest way for me as a case_when()
bill_length_sp_lm |> 
  broom::tidy() |>
  mutate(
    stars = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01 ~ "**",
      p.value < 0.05 ~ "*",
      p.value < 0.1 ~ ".",
      TRUE ~ ""
    )
  )

Or Fitting a Model On The Fly

  • Obviously no room for checking model diagnostics
penguins |>
  ## Piped data can be recalled using `_`
  lm( bill_length_mm ~ species * body_mass_g, data = _) |> 
  step() |> # Fit the best model using the AIC
  broom::tidy() |> # Turn the output into a tibble & add stars
  mutate(
    stars = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01 ~ "**",
      p.value < 0.05 ~ "*",
      p.value < 0.1 ~ ".",
      TRUE ~ ""
    )
  )

S3 Method Dispatch

S3 Method Dispatch

  • Objects with class lm and summary.lm are S3 objects
  • Very informal class structure in R
  • Easy to work with \(\implies\) easy to break

. . .

  • When we printed these objects \(\implies\) print.lm() or print.summary.lm()
  • Likewise when we plotted the lm object \(\implies\) plot.lm()

. . .

?plot.lm

S3 Method Dispatch

  • If we only want a subset of figures
par(mfrow = c(1, 2))
plot(
  bill_length_sp_lm, which = c(1, 3),
  caption = list(
    "Separate Intercepts: Residuals vs Fitted", NULL,
    "Separate Intercepts: Scale-Location"
  )
)

par(mfrow = c(1, 1))

S3 Method Dispatch

  • We can also check what we pass to broom::tidy()
?broom::tidy.lm

. . .

  • broom::tidy() has multiple versions for objects of different classes
    • e.g. broom::tidy.prcomp() for PCA results

S3 Method Dispatch

  • S3 objects can be easily broken
broken_lm <- bill_length_sp_lm
class(broken_lm) <- "htest"
broken_lm

. . .

  • R looked for the print.htest method
  • The structure didn’t match what was expected \(\implies\) nonsense output

Confidence Intervals

  • For confidence or prediction intervals, we can use predict.lm()
  • 95% Confidence Intervals: The mean will be in the given interval 95% of the time
  • 95% Prediction Intervals: An observation will be in the given interval 95% of the time
  • We will need a new data frame to make the predictions about

Confidence Intervals

## Create some new penguins
new_penguins <- tibble(
  species = c("Adelie", "Gentoo"), 
  body_mass_g = 4500
)
## Predict their mean bill length
predict(bill_length_sp_lm, newdata = new_penguins, interval = "confidence")
## Predict the range a new observation may lie in
predict(bill_length_sp_lm, newdata = new_penguins, interval = "prediction")

Confidence Intervals

  • We may wish to include our new penguins in these results
## Predict their mean bill length, but include the original data
new_penguins |>
  bind_cols(
    predict(bill_length_sp_lm, newdata = new_penguins, interval = "confidence")
  )

Confidence Intervals

  • Confidence Intervals for model terms can also be found using broom::tidy()
## Use broom:tidy to find confidence intervals for coefficients
bill_length_sp_lm |>
  broom::tidy(conf.int = TRUE) 

Closing Comments

Additional Plotting Comments

  • I rarely show diagnostic plots:
    • For me when fitting data & performing analysis
    • No need for ggplot versions

tibble(
  fitted= fitted(bill_length_sp_lm), 
  residuals = resid(bill_length_sp_lm)
) |> 
  ggplot(aes(fitted, residuals)) + 
  geom_point() + 
  geom_smooth(
    se = FALSE, colour = 'red', 
    linewidth = 0.5
    ) +
  ggtitle("Residuals vs Fitted") +
  labs(
    x = "Fitted Values", y = "Residuals"
  ) 

tibble(
  residuals = rstandard(bill_length_sp_lm)
) |> 
  ggplot(aes(sample = residuals)) + 
  geom_qq() +
  geom_qq_line() +
  ggtitle("Q-Q Residuals") +
  labs(
    x = "Theoretical Quantiles", 
    y = "Standardized Residuals"
  ) 

Additional Plotting Packages

  • Multiple options for great looking plots
    • ggpmisc for adding correlations, regression equations etc
    • ggstats for multiple fresh perspectives on coefficients
    • ggstatsplot also a wide range of plotting capabilities

Challenges

  1. How could we account for sex in the existing penguins model?
  2. Load the pigs dataset
  3. Make a boxplot:
    • len will be the predictor (\(y\))
    • Place dose on the \(x\)-axis using supp to fill the boxes
  4. Find the best model for len, using supp and dose as the predictors
  5. Check the residuals and diagnostic plots
  6. Make sure you can interpret the model coefficients
  7. What is the 99% confidence interval for supp = "VC" & dose = "High"
    • What does this mean?

References

Cochran, William G. 1954. “Some Methods for Strengthening the Common χ 2 Tests.” Biometrics 10 (4): 417.

Footnotes

  1. https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/designmatrices.html↩︎